Improving Entity Resolution with Global Constraints

نویسندگان

  • Jim Gemmell
  • Benjamin I. P. Rubinstein
  • Ashok K. Chandra
چکیده

Some of the greatest advances in web search have come from leveraging socio-economic properties of online user behavior. Past advances include PageRank, anchor text, hubsauthorities, and TF-IDF. In this paper, we investigate another socio-economic property that, to our knowledge, has not yet been exploited: sites that create lists of entities, such as IMDB and Netflix, have an incentive to avoid gratuitous duplicates. We leverage this property to resolve entities across the different web sites, and find that we can obtain substantial improvements in resolution accuracy. This improvement in accuracy also translates into robustness, which often reduces the amount of training data that must be labeled for comparing entities across many sites. Furthermore, the technique provides robustness when resolving sites that have some duplicates, even without first removing these duplicates. We present algorithms with very strong precision and recall, and show that max weight matching, while appearing to be a natural choice turns out to have poor performance in some situations. The presented techniques are now being used in the back-end entity resolution system at a major Internet search engine.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

An Optimal Approach to Local and Global Text Coherence Evaluation Combining Entity-based, Graph-based and Entropy-based Approaches

Text coherence evaluation becomes a vital and lovely task in Natural Language Processing subfields, such as text summarization, question answering, text generation and machine translation. Existing methods like entity-based and graph-based models are engaging with nouns and noun phrases change role in sequential sentences within short part of a text. They even have limitations in global coheren...

متن کامل

Exploring Local and Global Semantic Information for Event Pronoun Resolution

Event anaphora resolution plays a critical role in discourse analysis. This paper focuses on improving event pronoun resolution using both local and global semantic information. In particular, a predicate-argument structure is proposed to represent the local semantic information about an event while the global semantic information is represented by the entity coreference chains related with var...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

A Joint Inference Architecture for Global Coreference Clustering with Anaphoricity

We present an architecture for coreference resolution based on joint inference over anaphoricity and coreference, using Markov Logic Networks. Mentions are discriminatively clustered with discourse entities established by an anaphoricity classifier. Our entity-based coreference architecture is realized in a joint inference setting to compensate for erroneous anaphoricity classifications and avo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1108.6016  شماره 

صفحات  -

تاریخ انتشار 2011